Building an annotated corpus for Amazighe

نویسندگان

  • Mohamed Outahajala
  • Lahbib Zenkouar
  • Paolo Rosso
چکیده

This paper gives an overview of the morpho-syntactic features of the Amazighe language and corpus encoding, afterwards we present our experience of constructing an annotated corpus with part-of-speech (POS) information. The annotated corpora consist of 20,667 Moroccan Amazighe tokens chosen from different materials; it is to our knowledge the first one dealing with Amazighe language. The experience is also meant to give a handle on the encoding and tagging processes of the aforementioned corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

POS Tagging in Amazighe Using Support Vector Machines and Conditional Random Fields

The aim of this paper is to present the first Amazighe POS tagger. Very few linguistic resources have been developed so far for Amazighe and we believe that the development of a POS tagger tool is the first step needed for automatic text processing. The used data have been manually collected and annotated. We have used state-of-art supervised machine learning approaches to build our POS-tagging...

متن کامل

A Semi-Supervised Approach to Build Annotated Corpus for Chinese Named Entity Recognition

This paper presents a semi-supervised approach to reduce human effort in building an annotated Chinese corpus. One of the disadvantages of many statistical Chinese named entity recognition systems is that training data may be in short supply, and manually building annotated corpus is expensive. In the proposed approach, we construct an 80M handannotated corpus in three steps: (1) Automatically ...

متن کامل

Creating an Annotated Corpus for Generating Walking Directions

This work describes first steps towards building a system that synchronously generates multimodal (textual and visual) route directions for pedestrians. We pursue a corpus-based approach for building a generation model that produces natural instructions in multiple languages. We conducted an empirical study to collect verbal route directions, and annotated the acquired texts on different levels...

متن کامل

Studying Properties of Czech Complex Sentences from an Annotated Corpus

The paper deals with the problem of an analysis of complex sentences in Czech on the basis of manually annotated data. The availability of a specialized corpus explicitly describing mutual relationships between segments and clauses in Czech complex sentences, together with the availability of a thoroughly syntactically annotated corpus, the Prague Dependency Treebank, provide a solid background...

متن کامل

Building a Parallel Bilingual Syntactically Annotated Corpus

This paper describes a process of building a bilingual syntactically annotated corpus, the PCEDT (Prague Czech-English Dependency Treebank). The corpus is being created at Charles University, Prague, and the release of this corpus as Linguistic Data Consortium data collection is scheduled for the spring of 2004. The paper discusses important decisions made prior to the start of the project and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013